69 research outputs found
A parallel corpus of Python functions and documentation strings for automated code documentation and code generation
Automated documentation of programming source code and automated code
generation from natural language are challenging tasks of both practical and
scientific interest. Progress in these areas has been limited by the low
availability of parallel corpora of code and natural language descriptions,
which tend to be small and constrained to specific domains.
In this work we introduce a large and diverse parallel corpus of a hundred
thousands Python functions with their documentation strings ("docstrings")
generated by scraping open source repositories on GitHub. We describe baseline
results for the code documentation and code generation tasks obtained by neural
machine translation. We also experiment with data augmentation techniques to
further increase the amount of training data.
We release our datasets and processing scripts in order to stimulate research
in these areas.Comment: 5 pages, 1 figure, 3 table
Syntax-based machine translation using dependency grammars and discriminative machine learning
Machine translation underwent huge improvements since the groundbreaking
introduction of statistical methods in the early 2000s, going from very
domain-specific systems that still performed relatively poorly despite the
painstakingly crafting of thousands of ad-hoc rules, to general-purpose
systems automatically trained on large collections of bilingual texts which
manage to deliver understandable translations that convey the general
meaning of the original input.
These approaches however still perform quite below the level of human
translators, typically failing to convey detailed meaning and register, and
producing translations that, while readable, are often ungrammatical and
unidiomatic.
This quality gap, which is considerably large compared to most other
natural language processing tasks, has been the focus of the research in
recent years, with the development of increasingly sophisticated models that
attempt to exploit the syntactical structure of human languages, leveraging
the technology of statistical parsers, as well as advanced machine learning
methods such as marging-based structured prediction algorithms and neural
networks.
The translation software itself became more complex in order to accommodate
for the sophistication of these advanced models: the main translation
engine (the decoder) is now often combined with a pre-processor which
reorders the words of the source sentences to a target language word order, or
with a post-processor that ranks and selects a translation according according
to fine model from a list of candidate translations generated by a coarse
model.
In this thesis we investigate the statistical machine translation problem
from various angles, focusing on translation from non-analytic languages
whose syntax is best described by fluid non-projective dependency grammars
rather than the relatively strict phrase-structure grammars or projectivedependency
grammars which are most commonly used in the literature.
We propose a framework for modeling word reordering phenomena
between language pairs as transitions on non-projective source dependency
parse graphs. We quantitatively characterize reordering phenomena for the
German-to-English language pair as captured by this framework, specifically
investigating the incidence and effects of the non-projectivity of source
syntax and the non-locality of word movement w.r.t. the graph structure.
We evaluated several variants of hand-coded pre-ordering rules in order to
assess the impact of these phenomena on translation quality.
We propose a class of dependency-based source pre-ordering approaches
that reorder sentences based on a flexible models trained by SVMs and and
several recurrent neural network architectures.
We also propose a class of translation reranking models, both syntax-free
and source dependency-based, which make use of a type of neural networks
known as graph echo state networks which is highly flexible and requires
extremely little training resources, overcoming one of the main limitations
of neural network models for natural language processing tasks
Dialogue-based generation of self-driving simulation scenarios using Large Language Models
Simulation is an invaluable tool for developing and evaluating controllers
for self-driving cars. Current simulation frameworks are driven by
highly-specialist domain specific languages, and so a natural language
interface would greatly enhance usability. But there is often a gap, consisting
of tacit assumptions the user is making, between a concise English utterance
and the executable code that captures the user's intent. In this paper we
describe a system that addresses this issue by supporting an extended
multimodal interaction: the user can follow up prior instructions with
refinements or revisions, in reaction to the simulations that have been
generated from their utterances so far. We use Large Language Models (LLMs) to
map the user's English utterances in this interaction into domain-specific
code, and so we explore the extent to which LLMs capture the context
sensitivity that's necessary for computing the speaker's intended message in
discourse.Comment: 12 pages, 6 figures, SpLU-RoboNLP 202
Distributionally Robust Recurrent Decoders with Random Network Distillation
Neural machine learning models can successfully model language that is
similar to their training distribution, but they are highly susceptible to
degradation under distribution shift, which occurs in many practical
applications when processing out-of-domain (OOD) text. This has been attributed
to "shortcut learning": relying on weak correlations over arbitrary large
contexts.
We propose a method based on OOD detection with Random Network Distillation
to allow an autoregressive language model to automatically disregard OOD
context during inference, smoothly transitioning towards a less expressive but
more robust model as the data becomes more OOD while retaining its full context
capability when operating in-distribution. We apply our method to a GRU
architecture, demonstrating improvements on multiple language modeling (LM)
datasets.Comment: 8 pages, 1 figur
Distributionally Robust Recurrent Decoders with Random Network Distillation
Neural machine learning models can successfully model language that is similar to their training distribution, but they are highly susceptible to degradation under distribution shift, which occurs in many practical applications when processing out-of-domain (OOD) text. This has been attributed to âshortcut learningâ":" relying on weak correlations over arbitrary large contexts. We propose a method based on OOD detection with Random Network Distillation to allow an autoregressive language model to automatically disregard OOD context during inference, smoothly transitioning towards a less expressive but more robust model as the data becomes more OOD, while retaining its full context capability when operating in-distribution. We apply our method to a GRU architecture, demonstrating improvements on multiple language modeling (LM) datasets
Deep Architectures for Neural Machine Translation
It has been shown that increasing model depth improves the quality of neural
machine translation. However, different architectural variants to increase
model depth have been proposed, and so far, there has been no thorough
comparative study.
In this work, we describe and evaluate several existing approaches to
introduce depth in neural machine translation. Additionally, we explore novel
architectural variants, including deep transition RNNs, and we vary how
attention is used in the deep decoder. We introduce a novel "BiDeep" RNN
architecture that combines deep transition RNNs and stacked RNNs.
Our evaluation is carried out on the English to German WMT news translation
dataset, using a single-GPU machine for both training and inference. We find
that several of our proposed architectures improve upon existing approaches in
terms of speed and translation quality. We obtain best improvements with a
BiDeep RNN of combined depth 8, obtaining an average improvement of 1.5 BLEU
over a strong shallow baseline.
We release our code for ease of adoption.Comment: WMT 2017 research trac
- âŠ